Assay methods and compositions for detecting contamination of nucleic acid identifiers

ABSTRACT

The present invention relates to nucleic acid samples for massively parallel sequencing. More particularly, the present invention relates to assay methods, compositions and kits for detecting contamination of nucleic acid identifiers such as sample barcodes.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is continuation of U.S. patent application Ser. No.16/792,813, filed on Feb. 17, 2020, which is a divisional of U.S. patentapplication Ser. No. 15/645,085, filed on Jul. 10, 2017, now U.S. Pat.No. 10,633,651, the contents of all of which are fully incorporatedherein by reference.

SEQUENCE LISTING

This instant application contains a Sequence Listing which has beensubmitted in ASCII format via EFS-Web and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Jun. 28, 2021, isnamed 20170066-07_Sequence_Listing.txt, and is 1,914 bytes in size.

FIELD OF THE INVENTION

The present invention relates to the field of molecular biology. Inparticular, the present invention relates to assay methods andcompositions for detecting contamination of nucleic acid identifierssuch as sample barcodes.

BACKGROUND OF THE INVENTION

Identifiers (e.g., sample barcodes or molecular barcodes) can be presentin nucleic acids for a variety of purposes. Most commonly, samplebarcodes are added to target nucleic acid molecules prior to theamplification and/or sequencing of such molecules, so that the origin orsource of sequence information can be identified. Nucleic acid moleculesfrom different samples can be pooled together and subjected to massivelyparallel sequencing in order to efficiently determine sequenceinformation from numerous different samples. Prior to sequencing, sampleidentifiers (often referred to as sample barcodes) can be added to thenucleic acid molecules, and this facilitates grouping, analysis, andinterpretation of information. As another example, molecular barcodescan be added to target nucleic acid molecules prior to amplification, sothat the replicates of the initial target molecule can subsequently beidentified and grouped together.

Sample barcodes are frequently used with target molecules that will beanalyzed by massively parallel sequencing, so that nucleic acidmolecules from different samples can be pooled for sequencing, and thesequence information can be assigned to a sample. Scientists andlaboratories that perform massively parallel sequencing occasionallydetect a sample barcode in a pool even when this sample barcode was notincluded in the sequencing pool. This indicates that a contaminatingsample barcode is present in the pooled nucleic acids, which may becaused by a sample barcode aliquot containing more than one samplebarcode sequence, namely the expected barcode sequence and thecontaminating barcode sequence. Contaminating barcodes could beintroduced at any stage of the preparation of sample barcode aliquots,beginning from the earliest stage, including the synthesis andpurification of DNA oligos, or though handling steps in the process ofdiluting and aliquoting sample barcode sequences. Even when present atlow frequencies, such as 1% or lower, the presence of contaminatingsample barcodes can create problems with regard to the reliability andinterpretation of the sequence information.

Sample barcodes are often provided in a set of containers, such as awell plate, where each container holds a different sample barcode. Whenthe sample barcodes are used in laboratory analysis, such as bypipetting the sample barcodes from their containers to the varioussamples to be analyzed, there is a risk that a container or sample maybecome contaminated.

Contamination of sample barcodes could be detected by preparingindividual sequencing libraries for each sample barcode and sequencingthem individually. Alternatively contamination could be detected with apooling scheme that provides the ability to compare a sample barcode andcontamination of another sample barcode in at least one of the pools.However, a large number of pools would have to be prepared and sequencedin separate sequencing runs in order to isolate sample barcodes from alarge number of samples, such as 48 or 96 samples. This would beexpensive, inefficient and time-consuming. It also has the potential oferroneously finding contamination in a sample barcode that was notpresent in the tube, but instead introduced in one of the many librarypreparation steps, leading to false positives.

SUMMARY OF THE INVENTION

As one aspect of the present invention, methods are provided forattaching assay identifiers (e.g., quality control barcodes) to a set ofoligonucleotide samples comprising oligonucleotides, where eacholigonucleotide comprises a 5′ constant region, a sample identifier(e.g., a sample barcode), and a 3′ constant region, and each sampleidentifier is unique in the set in the absence of contamination. In someembodiments, the constant regions comprise standard amplificationregions for a sequencing platform, or their reverse complement. Forexample, in some embodiments, the 5′ constant region is an IlluminaIndex 1 sequence and the 3′ constant region is the reverse complement ofIllumina P7 sequence (P7′), and in other embodiments, the orientation isreversed such that the 5′ constant region is an Illumina P7 sequence andthe 3′ constant region is an Illumina Read 2 sequence. The methodscomprise providing each of the oligonucleotide samples of the set in aseparate vessel, so that each vessel comprises only one sampleidentifier unless one or more of the samples is contaminated. Themethods also comprise amplifying the oligonucleotides with an assayprimer and a second primer in each vessel. Assay primers comprise one ormore constant regions (such as P5 and a Read 1 Primer sequence), anassay identifier, and a priming portion that is the same as orcomplementary to one of the constant regions of the oligonucleotides.Each vessel comprises only one assay identifier unless one or more ofthe assay primers are contaminated. The method thus providesoligonucleotide amplicons comprising an assay identifier and a sampleidentifier.

As another aspect, methods are provided for detecting contamination in aset of oligonucleotides comprising sample identifiers. The methodscomprise providing a set of oligonucleotide samples comprisingoligonucleotides, each oligonucleotide having a 5′ constant region, asample identifier (such as a sample barcode), and a 3′ constant region.Oligonucleotides within a sample have the same sample identifier andeach of the samples within the set has a different sample identifier,unless one or more of the samples is contaminated. The methods alsocomprise amplifying the oligonucleotides or complements of theoligonucleotides with assay primers and a second primer. A differentassay primer is used for each sample, and each assay primer comprises apriming portion and an assay identifier (such as a QC barcode), therebygenerating a set of oligonucleotide amplicons. Each oligonucleotideamplicon comprising one of the assay identifiers, the 5′ constantregion, one of the sample identifiers, and the 3′ constant region. Themethods also comprise pooling the oligonucleotide amplicons in one ormore pools; sequencing the one or more pools to determine sequenceinformation for at least the sample identifier and the assay identifierof the oligonucleotide amplicons; determining whether the sampleidentifiers in a first pool include a contaminating sample identifier;and determining whether the assay identifiers in the first pool includea contaminating assay identifier.

In some embodiments, the present methods comprise pooling theoligonucleotide amplicons in at least two pools, and separatelysequencing the first pool and the second pool to determine sequences forat least the sample identifier and the assay identifier of theoligonucleotide amplicons. The present methods can also comprisedetermining whether the sample identifiers in the second pool include acontaminating sample identifier. In some embodiments, the presentmethods also comprise determining whether the assay identifiers in thesecond pool include a contaminating assay identifier. In someembodiments, the present methods further comprise identifying acontaminating sample identifier in a first pool by determining that thecontaminating sample identifier is from a second pool. In someembodiments, the present methods further comprise identifying acontaminating sample identifier in a first pool by determining that thesecond pool does not include a contaminating assay identifier. In someembodiments, the present methods further comprise identifying acontaminating assay identifier in a first pool by determining that thesecond pool includes a contaminating assay identifier. In someembodiments, the contaminating sample identifier is determined by one orboth of (i) identifying one or more of the sample identifiers that areassociated with more than one assay identifier, and (ii) identifyingassay identifiers that are associated with more than one sampleidentifier

As another aspect, compositions are provided which are useful in assaysadapted for determining contamination in a set of oligonucleotidescomprising sample identifiers. The compositions comprise at least oneoligonucleotide having a 5′ constant region, a sample identifier (suchas a sample barcode), and a 3′ constant region, and at least one assayprimer comprising a priming portion and an assay identifier. In someembodiments, the compositions further comprise one or more of a DNApolymerase, and deoxynucleotides.

As yet another aspect, kits are provided for assays adapted fordetermining contamination in a set of oligonucleotides comprising sampleidentifiers. The kits comprise at least 8 assay primers, alternativelyat least 16 assay primers, alternatively at least 32 assay primers,alternatively at least 48 primers or at least 96 primers, in separatevessels. Each assay primer identifier comprises a priming portion and anassay identifier.

In some embodiments of the foregoing aspects, a set or pool ofoligonucleotide samples comprises at least 8 samples, alternatively atleast 16 samples, alternatively at least 32 samples, alternatively atleast 48 samples, alternatively at least 96 samples, where each samplehas a sample identifier that is unique within the set or pool. In someembodiments, a set of assay primers comprises at least 32 assayidentifiers, alternatively at least 48 assay identifiers, alternativelyat least 96 assay identifiers, where each assay primer has an assaysample identifier that is unique within the set or pool.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A, 1B and 1C show embodiments of the present methods of attachingan assay identifier to an oligonucleotide having a sample identifier.

FIG. 2 shows sequence of two different embodiments of assay primersaccording to the present disclosure. The two embodiments contain many ofthe same regions, but the 5′ constant regions are different. In version2, there is less overlap between the 5′ constant region and the 3′constant region.

FIG. 3 shows the distribution of amplicon sizes from amplification of anoligonucleotide using the assay primer of the first embodiment in FIG.2.

FIG. 4 shows the distribution of amplicon sizes from amplification of anoligonucleotide using the assay primer of the second embodiment in FIG.2.

FIG. 5 shows another embodiment of the present methods of attaching anassay identifier to an oligonucleotide having a sample identifier, wherethe identifier is attached at a 3′ location relative to the sampleidentifier.

FIG. 6 shows another embodiment of the present methods of attaching anassay identifier to an oligonucleotide having a sample identifier, wherethe constant regions of the oligonucleotide are not compatible with adesired sequencing platform.

FIG. 7 shows a pooling scheme for detecting contamination of sampleidentifiers using the present methods and compositions.

DETAILED DESCRIPTION OF THE INVENTION

The present methods, compositions and kits are useful for detectingcontamination in a set of oligonucleotides for nucleic acid samples andallow the production of sample identifier sets that are substantiallyfree of contamination. This is a significant advance and benefit, as thepresence of sample barcode contamination may result in result in falsecalling of genetic variants which can have severe consequences forresearch and clinical applications.

The methods, compositions and kit employ oligonucleotides which have a5′ constant region, a sample identifier, and a 3′ constant region. Eachof the oligonucleotides within a sample has the same sample identifierand each of the samples within the set has different sample identifiers,unless one or more of the samples is contaminated by a contaminatingsample identifier. In some embodiments, each of the samples within theset has a sample identifier which is unique in the set, meaning that itis intended to be and will be unique in the absence of contamination.

A “sample identifier” comprises a sample barcode or any degenerate orrandom sequence that can be used to identify a sample. Sampleidentifiers may be flanked (directly or indirectly) by constant regions.In some embodiments, the sample identifier can be a sample barcodecomprising 6 or more random or degenerate nucleotides; alternatively thesample identifier can be a sample barcode comprising 8 or more random ordegenerate nucleotides, or 10 or more random or degenerate nucleotides.In some embodiments, a sample identifier comprises 8 known bases, and anassay identifier comprises 10 degenerate bases. In other embodiments, asample identifier comprises 4 known bases or 6 known bases. In someembodiments, the number of bases in the sample identifier can beselected based on the number of samples to be distinguished. Longersample identifiers and sample barcodes are also possible. For example, asample identifier comprising 18 bases (8 known bases and 10 degeneratebases) has been employed to prepare a library of oligonucleotides for anIon Torrent sequencing platform. A sample identifier with more than 19bases is also feasible and may be desired, especially if the assay isused for other sequencing platforms and applications. In someembodiments, the complement of an initial sample barcode is in anoligonucleotide amplicon, and this complement is also considered asample identifier.

A “constant” region is one that comprises a known sequence, and becauseit is known, it can serve a desired function. A constant region willgenerally be the same or substantially the same among oligonucleotidesof a set. The known sequence can serve as a priming site (region) foramplification or primer extension, and/or can hybridize to a nucleicacid attached to a support. In some embodiments, a constant regioncomprises a sequence of standard region, such as a standardamplification region used in a sequencing platform. A constant regioncan comprise a number of nucleotides from a known or standard regionsufficient for the function of the standard region, such as a sufficientnumber of nucleotides to hybridize to a standard primer foramplification.

A “contaminating” molecule or sequence is one that is not designed to bein a set or pool, or should not be present in a set or pool or sampleunless there is some contamination. For example, a barcode in a firstset or pool of sequences is a contaminating barcode if it should not bepresent in the first set or pool and/or should only be present in asecond set or pool.

The present methods and compositions provide a solution to the problemof identifying contamination in sets of oligonucleotides comprisingsample identifiers such as sample barcodes. The present techniques havea relatively small number of handling steps, which is desirable sincehandling steps increase risk of contamination. Additionally, a poolingscheme and analysis method is provided which reduces the number of poolsand sequencing runs required to detect contamination between samples.Instead of a large number of pools, this present method can reduce thepools used to detect contamination in a set of 96 sample identifiers. Insome embodiments, two sequencing pools are used to detect sampleidentifier contamination in a set of 96 sample identifiers.

The present methods and compositions can also be used to amplifyoligonucleotides (such as library molecules, adaptors, aptamers or otherssDNA molecules used to target proteins or peptides) which have a seriesof random nucleotides (which are considered sample identifiers herein)between two constant regions in order to detect sequence diversity,including detection of molecular barcodes. It could also be used toidentify single nucleotide polymorphs (SNPs) or sites of mutagenesis inknown regions of DNA.

The oligonucleotides which may be assayed by the present methods includeadaptors for nucleic acid molecules or regions from standard adaptors,such as the amplification region from a standard adaptor for asequencing platform. The oligonucleotide can also include a label, tag,or other moiety. By way of example, the oligonucleotide includes abiotin moiety, allowing for enrichment of the oligonucleotides bybinding to avidin or streptavidin. This approach is used in thecommercially available Haloplex kit (Agilent Technologies). Theoligonucleotides which may be assayed by the present methods includelibrary molecules, which are molecules prepared to be part of a libraryfor a sequencing platform. A library molecule generally comprises aninsert to which a sample identifier and one or more standard regions forsequencing platforms are attached. Other regions can also be included ina library molecule. With a library molecule, the sample identifier canbe a molecular barcode, or it can be a second sample barcode that is inaddition to a first sample barcode.

The methods also comprise amplifying the oligonucleotides or complementsof the oligonucleotides with assay primers and a second primer. Adifferent assay primer is used for each sample, and each assay primercomprises a priming portion and an assay identifier (such as a QCbarcode), thereby generating a set of oligonucleotide amplicons. Eacholigonucleotide amplicon comprises one of the assay identifiers, the 5′constant region, one of the sample identifiers, and the 3′ constantregion. The present assay methods can be readily adapted to variousstandardized sequencing platforms (for example, the Illumina and IonTorrent sequencing platforms), by selecting constant regions that arestandard for those platforms.

In some embodiments, the present methods detect sample identifiercontamination at a level less than 1%, alternatively less than 0.5%,alternatively less than 0.1% using a small number of handling steps toavoid or prevent assay-induced contamination, and provide a method ofpooling and analysis, such that a small number of sequencing runs isperformed. The present disclosure provides a fast and relativelyinexpensive method to prepare libraries from potentially contaminatedoligonucleotides having sample identifiers. The libraries are adaptedfor sequencing, especially massively parallel sequencing, on one or moredesired sequencing platforms.

In some embodiments, the oligonucleotide amplicons comprise a 5′constant region and a 3′ constant region. Furthermore, the 5′ constantregion comprises a standard 5′ adaptor for a sequencing platform and asequencing priming region, an assay identifier, a middle constant regioncomprising a sequencing priming region, and a sample identifier, and the3′ constant region comprising a standard 3′ adaptor for a sequencingplatform. In some embodiments, the oligonucleotide amplicons comprise(i) a 5′ constant region comprising a standard 5′ adaptor for asequencing platform and a sequencing priming region, (ii) an assayidentifier, (iii) a middle constant region comprising a sequencingpriming region, (iv) a sample identifier, and (v) a 3′ constant regionof comprising a standard 3′ adaptor for a sequencing platform. Forexample, a standard 5′ adaptor can comprise an Illumina P5 or P5′sequence, and a standard 3′ adapter can comprise an Illumina P7 or P7′sequence. P7′ indicates the complement of P7; likewise, P5′ indicatesthe complement of P5. In other embodiments, the oligonucleotide ampliconcomprises a 5′ constant region comprising a standard 5′ adapter, asample identifier, a middle constant region, an assay identifier, and a3′ constant region comprising a standard 3′ adapter.

The present methods, compositions and kits can also be used to a modifyan oligonucleotide comprising a region that is standard for a firstsequencing platform (for example, an amplification region or asequencing primer site (region)), so that it includes a region that isstandard for a different sequencing platform. In some embodiments, asecond primer comprises a 3′ region complementary to a 3′ constantregion of the oligonucleotides, and the second primer further comprisesa 5′ region comprising a standard amplification region, wherein the 3′constant region of the oligonucleotides comprises a standardamplification region for a different sequencing platform than thestandard amplification region of the 5′ region of the second primer.

The present disclosure also provides novel pooling and sequencingschemes for identifying contamination of sample identifiers and assayidentifiers. In some embodiments, the present methods comprise poolingthe oligonucleotide amplicons in at least two pools; sequencing the twopools to determine the sequences of at least portions of theoligonucleotide amplicons comprising the sample identifiers and theassay identifiers; determining whether the sample identifiers in thesecond pool include a contaminating sample identifier; and determiningwhether the assay identifiers in the second pool include a contaminatingassay identifier. In some embodiments, the present methods furthercomprise determining a contaminating sample identifier by determiningthat the contaminating sample identifier is from a second pool. In someembodiments, the methods further comprise identifying a contaminatingsample identifier by determining that the second pool does not include acontaminating assay identifier. In some embodiments, the present methodsfurther comprise identifying a contaminating assay identifier bydetermining that the second pool does not include a contaminating assayidentifier.

In some embodiments, the present methods further comprise groupingsequences of the oligonucleotide amplicons according to the assayidentifiers to form assay groups; and determining if there is more thanone sample identifier sequence in each of the assay groups. In someembodiments, the present methods further comprise grouping sequences ofthe oligonucleotide amplicons according to the sample identifiers toform sample groups; and determining if there is more than one assayidentifier sequence in each of the sample groups. In some embodiments,the methods comprise forming at least two pools from the oligonucleotideamplicons; sequencing at least two pools of amplicons to obtain sequenceinformation of the oligonucleotide amplicons; wherein the sequenceinformation for the individual oligonucleotide amplicon at leastcomprises the sequence of the assay identifier and the sampleidentifier. In some embodiments, the present methods can comprisegrouping amplicon sequence information according to the assayidentifier; and determining if grouped amplicon sequence informationcontains more than one of the sample identifiers.

The methods can comprise determining if there is a mismatch between anassay identifier and a sample identifier, such as where at least one ofthe sample identifiers is associated an assay identifier that it shouldnot be associated with, and/or where at least one of assay identifiersis associated with a sample identifier that it should not be associatedwith.

The present methods can be used with sample preparation kits for NGS.They can also be used with library preparation reagents. The presentmethods can also be employed to assay target enrichment kits and setsthat contain sample barcodes or other identifiers, including SureSelectreagent kits. SureSelect kits (available from Agilent Technologies)contain oligonucleotides having a sample identifier and having one ormore constant regions 5′ and 3′ to the sample identifier, namely PCRprimers.

The present disclosure allows for the production of sample identifiersets or kits that are substantially free of contamination, such ashaving less than 0.1% of a contaminating sample identifier, or less than0.01%.

In FIG. 1A, an oligonucleotide 102 comprises a 5′ constant region 110, asample identifier 112, and a 3′ constant region 114. For example, the 5′constant region 110 can comprise a standard sequence such as an IlluminaIndex 1 sequence, the sample identifier 112, and the 3′ constant region114 can comprise a standard amplification sequence, such as the IlluminaP7′ sequence. The constant regions can comprise any standard primingsite (region) for amplification or sequencing. The oligonucleotide 102is amplified using a primer 104 having a priming region 115complementary to at least a portion of the 3′ constant region 114. Forexample, the primer 104 can be a P7 primer. In the same step or asubsequent step, the oligonucleotide 102 or complement thereof isamplified with a primer 106 having a priming region 120 complementary toat least a portion of the 5′ constant region 110 or its complement 111.The primer also comprises an assay identifier 122 and one or moreconstant regions 126, 124 (for example an Illumina P5 sequence 126 and aread 1 sequencing primer 124). Additional rounds of amplificationproduce oligonucleotide amplicons 108 comprising one or more constantregions 126, 124, the assay identifier 122, the sequence of the 5′constant region 120 of the initial oligonucleotide, the sampleidentifier sequence 128, and the 3′ constant region 130 of the initialoligonucleotide. The sample identifier sequence 128 in the amplicons 108is generally an identical copy of the sample identifier 112 of theoligonucleotide 102. Constant region 120 of the amplicon 108 will bemostly identical to constant region 110 of the oligonucleotide 102,however either could be partially truncated. For example, constantregion 110 could be truncated on the 5′ end, and constant region 120could be truncated on the 3′ end. Likewise, constant region 130 of theamplicon 108 and constant region 114 of the oligonucleotide 102 willgenerally be the same, though constant region 114 could be partiallytruncated on the 3′ end, and constant region 130 could be partiallytruncated on the 5′ end. The oligonucleotide amplicons 108 are adaptedfor sequencing on a standard platform for massively parallel sequencingdue to the constant regions.

FIG. 1B shows another embodiment of the present methods. In thisembodiment, oligonucleotide 103 comprises a 3′ constant region 111, asample identifier 113, and a 5′ constant region 115. For example, the 3′constant region 111 can be the Illumina Read 2 sequence (or anotherstandard region for a sequencing platform), and the 5′ constant region115 can be an Illumina P7 sequence or any standard priming site (aregion) for amplification or sequencing. Amplification producesoligonucleotide amplicons 109 comprising one or more constant regions127, 125, the assay identifier 123, the 3′ constant region 111, thesample identifier sequence 113, and the sequence of the 5′ constantregion 115. Additional rounds of amplification can be conducted withprimer 131 which has the same sequence as a portion of constant region115 sufficient to function as a primer.

FIG. 1C demonstrates how the assay method can be performed when theinitial oligonucleotide is a library molecule, that is a moleculecomprising an insert to which a sample identifier and standard regionsfor sequencing platforms are attached. In this embodiment, the assaymethod can detect contamination that occurred during the librarypreparation. The oligonucleotide 102 comprises a first 5′ constantregion 110, a sample identifier 112, a 3′ constant region 114, andfurther comprises an insert 140, a second 5′ constant region, 142 (suchas a Read 1 priming site), an optional second sample identifier 144, anda third 5′ constant region 146 (for example, an amplification primingsite). The insert 140 comprises a target sequence to be studied,analyzed or subjected to additional testing, such as sequencing on amassively parallel sequencing platform. A second sample identifier 144is optionally included in many library preparations. Oligonucleotide 103(which is a complementary strand of oligonucleotide 102) comprises afirst 3′ constant region 111, a sample identifier 113, a 5′ constantregion 115, and further comprises an insert 141, a second 3′ constantregion 143 (such as a Read 1 priming site), a optional second sampleidentifier 145, and a third 3′ constant region 147 (for example, anamplification priming site). The oligonucleotide 102 is amplified usingan assay primer 104 having a priming region 115 complementary to atleast a portion of the 3′ constant region 111. For example, the primer104 can be a P7 primer. In the same step or a subsequent step, theoligonucleotide 102 or complement thereof is amplified with a primer 106having a priming region 120 complementary to at least a portion of the5′ constant region 110 or its complement 111. The primer also comprisesan assay identifier 122 and one or more constant regions 126, 124 (forexample an Illumina P5 sequence 126 and a Read 1 sequencing primerregion 124). Additional rounds of amplification produce oligonucleotideamplicons 108 comprising one or more constant regions 126, 124, theassay identifier 122, the sequence of the 5′ constant region 120 of theinitial oligonucleotide, the sample identifier sequence 128, and the 3′constant region 130 of the initial oligonucleotide. In the embodimentshown, oligonucleotide amplicon 108 does not include insert 140, but insome embodiments, primer 106 binds a region 3′ to the insert 140, andthe insert 140 is thereby included in the amplicons. A pooling method(as described in Example 4) can be employed on a library prepared withtwo or more sample barcodes (where the barcodes are attached via eitherligation or amplification) and the pooling method can be used toidentify if sample barcode contamination occurred after the librarypreparation was performed.

By the selection of constant regions and priming regions on the assayprimers, this method is adaptable for different library preparationmethods (including Haloplex XTHS, Haloplex HS, SureSelect XT, andSureSelect QXT, all from Agilent) and different standardized sequencingplatforms (including Illumina and Ion Torrent). Sequencing platforms formassively parallel sequencing include Ion Torrent PGM and Protonsemiconductor sequencers, and Illumina MiSeq, HiSeq, MiniSeq, andNextSeq. Other sequencing platforms are in development and the presentcompositions and methods can be used with the standard amplificationregions for those platforms.

In some embodiments, constant regions on the oligonucleotide and/or theassay identifier comprise sequences suitable for use on a standardizedsequencing platform. For example, a constant region can have thesequence of an amplification region for an Illumina sequencing platform,such as an Illumina P5 sequence or an Illumina P7 sequence, or such asan Ion Torrent Adapter A sequence or an Ion Torrent Adapter P1 sequence,or such as the sequencing primer regions, such as Illumina Read1,Index1, Read2 or Index2. Other amplification regions or sequencingprimer regions can be used for different platforms. Table 1 sets forththe sequences of standard regions currently used in Illumina and IonTorrent sequencing platforms:

TABLE 1 Illumina P5 5′- AATGATACGGCGACCACCGA (SEQ ID NO:1) -3′ IlluminaP7 5′- CAAGCAGAAGACGGCATACGAGAT (SEQ ID NO:2) -3′ Illumina Read1 5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCT (SEQ ID NO:3) -3′ Illumina Index1 5′-GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (SEQ ID NO:4) -3′ Illumina Read2 5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (SEQ ID NO:5) -3′ Illumina Index2 5′-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT (SEQ ID NO:6) -3′ IonTorrent A 5′-CCATCTCATCCCTGCGTGTCTCCGACTCAG (SEQ ID NO:7) -3′ IonTorrent P1 5′-CCTCTCTATGGGCAGTCGGTGAT(SEQ ID NO:8) -3′In some embodiments, a constant region of an oligonucleotide comprises asequence selected from the sequences set forth in Table 1.

FIG. 5 shows how the present methods and compositions can be used to addan assay identifier at a 3′ location relative to the sample identifier.This approach is especially suitable for oligonucleotides which areadapters configured for attachment to 5′ ends of target molecules to besequenced or primers intended to amply the 5′ end of target molecules.Thus, in this embodiment, the present methods are particularly suitedfor detecting identifiers present in a 5′ adaptor (and is an alternativeto a 3′ adaptor as shown in FIG. 1).

In FIG. 5, an oligonucleotide 502 comprises a 5′ constant region 510, asample identifier 512, and a 3′ constant region 514. For example, the 5′constant region 510 can be an Illumina P5 sequence, the sampleidentifier 512 can be a sample barcode, and the 3′ constant region 514can be an Illumina Read 1 sequence. The oligonucleotide 502 is amplifiedusing a primer 504 having a priming region 515 complementary to at leasta portion of the 3′ constant region 514. For example, the priming region515 can be the reverse complement of the 3′ constant region 514, that isthe reverse complement of an Illumina Read 1 sequence. Primer 504 alsocomprises an assay identifier 517 and an adapter 519 for a sequencingplatform or its complement, for example the reverse complement ofIllumina P7 (P7′). The oligonucleotide 502 or complement thereof isamplified with a primer 506 having a priming region 520 complementary toat least a portion of the 5′ constant region 510 or its complement.Additional rounds of amplification produce oligonucleotide amplicons 508comprising a 3′ adapter 518, the assay identifier 522, 516, the 3′constant region 514, the sample identifier 512, and 5′ constant region520. The oligonucleotide amplicons 508 are adapted for sequencing on astandard platform for massively parallel sequencing because at leastone, and often both, constant regions include an adapter for such aplatform.

FIG. 6 shows how the present assay methods and compositions can be usedto detect contamination in the oligonucleotides when they are surroundedby two constant regions, and neither of those constant regions iscompatible with the sequence platform to be used for the assay.Alternatively, this approach can be used to convert adaptors and primersfrom one sequencing platform so that they can be sequenced on anotherplatform. For example, the oligonucleotides such as adaptors used in anIon Torrent HaloPlex assay can be assayed using an assay primercontaining: Illumina P5, QXT Read1, QC index, IonTorrent Read primers;and an amplification primer containing: Illumina P7 and the reversecomplement to the Haloplex dark bases (dark bases are those that do notgenerate the fluorescence associated with nucleotide incorporationduring sequencing). This allows these primers to be assayed forcontamination on an Illumina sequencer. This approach can also be usedto allow sequencing of oligonucleotides that are not intended forsequencing and do not include amplification regions for sequencingplatforms, provided those oligonucleotides comprise a 5′ constantregion, an unknown region, and a 3′ constant region.

In FIG. 6, an oligonucleotide 602 comprises a 5′ constant region 610, asample identifier 612, and a 3′ constant region 614. In this embodiment,the constant regions 610, 614 of oligonucleotide 602 are for a firstsequencing platform, such as an Ion Torrent sequencing platform, but itis desired to sequence the oligonucleotide 602 on a second sequencingplatform, such as an Illumina sequencing platform. For example, the 5′constant region 610 can be an Ion Torrent Adapter A sequence, the sampleidentifier 612 can be a sample barcode, and the 3′ constant region 614can be dark bases provided to allow for ligation and quality control.The oligonucleotide 602 is amplified using a primer 604 having a primingregion 615 complementary to at least a portion of the 3′ constant region614 (that is, complementary to at least a portion of the dark bases).Primer 604 also comprises a region 617 comprising a region correspondingto a standard amplification region for a sequencing platform, forexample, an Illumina P7 sequence. The oligonucleotide 602 or complementthereof is amplified with a primer 606 having a priming region 620complementary to at least a portion of the 5′ constant region 610 or itscomplement 611. The primer 606 also comprises an assay identifier 622and one or more constant regions (for example an Illumina P5 sequence626 and an Illumina Read 1 sequence 624). Amplification continues withprimer 606 and primer 604 using suitable amplification cycles to provideoligonucleotide amplicons suitable for sequencing on an Illuminasequencing platform. Additional rounds of amplification produceoligonucleotide amplicons 608 comprising one or more constant regions626, 624, the assay identifier 622, the sequence 620 of the 5′ constantregion 610 of the initial oligonucleotide 620, the sample identifier612, the sequence of the 3′ constant region 614 and an amplificationregion 628. The oligonucleotide amplicons 608 are adapted for sequencingon a standard platform for massively parallel sequencing due to theconstant regions 626 and/or amplification region 628.

In some embodiments, the presence of a complementary DNA strand (as inthe case of an adaptor) may cause problems with detecting contaminationor sequence variation, if the complementary adaptor strand contains bothof the binding regions for amplification primers. In such situations,both strands will be amplified and any detected contamination/sequencevariation could be due to differences in the sequence of the barcodesequence present on the two strands. In many cases, the adaptor designis such that this will not occur.

EXAMPLE 1

An embodiment of the present methods is employed to determine whetherthere is sample barcode contamination in a kit having Illumina adaptersequences. As shown in FIG. 1A, an oligonucleotide 102 having a sampleidentifier 112 is flanked by Illumina Index1 sequence as its 5′ constantregion 110, and an Illumina P7′ sequence as its 3′ constant region 114.P7′ indicates the complement of P7; likewise, P5′ indicates thecomplement of P5. FIG. 1 illustrates a method for detectingcontamination of this oligonucleotide 102 with oligonucleotides having adifferent sample identifier. Amplification can be performed using astandard DNA polymerase, a P7 primer, and another primer containing P5,a Read 1 Primer sequence, a QC barcode and Index 1 sequence (from 5′ to3′, respectively). A high fidelity DNA polymerase can be used to reduceor minimize erroneous contamination detection due to PCR errors.

Two versions or embodiments of the assay primers were used to developthe assay. The sequences of these two versions are shown in FIG. 2.Initial attempts using version 1 of the assay primer, which contain boththe Illumina Read 1 primer and the reverse complement of Illumina Read 2(Index 1) primer sequence in the assay primer, resulted in a smallamount of the expected 130 bp amplicon and a large amount of shorteramplification products (Lane B1 in FIG. 3). These products potentiallycome from secondary amplification products that are created due to the13 bp complementarity between the 3′ end of Read 1 and the 5′ end ofIndex 1. By changing the sequence of Read 1 from the Illumina sequenceto the QXT Read 1 sequence (version 2 of the assay primer), thesesecondary amplification products were largely eliminated (Lane B1 inFIG. 4).

EXAMPLE 2

Haloplex and Haloplex HS Kits were tested to see if the oligonucleotidecontaining the sample barcodes could be amplified in the supplied indexsolution supplied in the kits. It was found that the oligonucleotidescould be cleanly amplified as a strong amplification product wasgenerated when using the assay primer (FIG. 4, lane B1 (supplied indexsolution)).

EXAMPLE 3

Assay primers were tested with SureSelect XT and SureSelect XT2 reagentkits, and oligonucleotides were successfully amplified. The presentassay primers were also used to test SureSelect XTHS reagent kits, withmodifications to the overlap sequence, and oligonucleotides weresuccessfully amplified.

Amplification of these libraries can occur even when the oligonucleotideis modified in a way to prevent elongation, as subsequent rounds afterthe first two rounds use the synthesized molecule as a template. Theamplification method also works in the presence of 5′ biotinmodifications.

EXAMPLE 4

A set of 96 or more sample identifiers is provided. The set can be usedto add sample identifiers to nucleic acids prior to amplification and/orprior to pooling before sequencing. However, if contamination occurredin one of these sample identifiers during kit assembly or reagentpreparation, it could cause the detection of a low allele variant in asample. To be confident about lack of contamination, it would take alarge number of sequencing runs to ensure every sample identifier couldbe confirmed as having no contamination.

The following scheme overcomes this limitation and can be used todetermine contamination of sample identifiers (also referred to assample barcodes or SBCs in this example) and/or assay identifiers (alsoreferred to as QC barcodes or QCBCs in this example). A set of 96oligonucleotides containing different sample identifiers are split intotwo groups: Group 1 and Group 2, each containing 48 of theoligonucleotides. Group 1 has SBC1 to SBC48, and Group 2 has SBC49 toSBC96. Each sample identifier in Group 1 is amplified with an assayprimer containing one of 48 different assay identifiers (QCBC1 toQCBC48). Each sample identifier in Group 2 is amplified with one of thesame 48 assay identifiers that was used in Group 1, such that everyassay identifier (QCBC1 through QCBC48) is present in both Groups and intwo amplification reactions, and every sample identifier (SBC1 throughSBC96) is only present in only one Group and in one amplificationreaction. The association of assay identifiers (QCBCs) with sampleidentifiers (SBCs) according to the scheme is shown in FIG. 7. Forillustrative purposes, the SBCs are shown as being arranged in a 96-wellplate, though they do not have to be provided or used in well plates.

PCR amplification produces oligonucleotide amplicons having a QCBC andan SBC. In the absence of contamination, each SBC is associated with oneQCBC. In other words, when sequenced, the sequence information for eachan SBC should have a single QCBC associated with it. FIG. 7 shows theassociations that will be produced using this scheme. However, it isdesirable to sequence amplicons in pools rather than individually usingmassively parallel sequencing, thereby reducing time, expense, andeffort required for sequencing. Thus, the oligonucleotide ampliconsgenerated in Group 1 are pooled together and sequenced, and theoligonucleotide amplicons from Group 2 are pooled together andsequenced. The sequencing of the pools produces sequence information forthe various amplicons included in the pools, and the sequencinginformation for a given amplicon will have a sample identifier and anassay identifier associated with it.

Sequencing in this manner will allow for the detection of contaminationdue to sample identifiers or assay identifiers based on the associationsidentified after analysis of the sequence information. For thisanalysis, it is helpful to include all the potential sample identifiers(whether then are intended to be present in the pool or not) in theanalysis of the sequencing information. If contamination occurs, it canbe from the sample identifier or the assay primer. The pattern in whichsample identifiers and assay identifiers appear in the two sequencingpools (from Group 1 and Group 2) will determine whether it is sampleidentifier contamination or assay identifier contamination. The presentscheme allows one to determine which is the source of the contamination.

If a sample identifier from Group 2 is observed in Group 1 (for example,if the sequence of SBC66 is found in the sequencing information forGroup 1), this indicates contamination of one of the sample barcodes inGroup 1, as there are 49 sample identifiers rather than the expected 48.However, this knowledge alone does not indicate which of the sampleidentifiers in Group 1 was contaminated with SBC66. The specific samplebarcode contaminated is determined based on which assay identifier isassociated with the contaminating SBC66. If the SBC66 found in the firstpool is associated with QCBC10, then SBC10 is the sample identifier thatwas contaminated with SBC66. Whichever sample identifier in Group 1 hasthe same assay identifier associated with it as the contaminating sampleidentifier, that is the sample identifier that is contaminated.

Additionally, the present methods, compositions and kits can also detectcontamination within a pool by identifying sample identifiers that areassociated with more than one assay identifier and/or by identifyingassay identifiers that are associated with more than one sampleidentifier. If sequence information indicates the presence of ampliconshaving SBC13 and QCBC13, as well as amplicons having SBC13 and QCBC29(that is, SBC13 is associated with QCBC13 and with QCBC29), thisindicates there is some contamination. However, this knowledge alonedoes not indicate whether SBC29 was contaminated with SBC13, or whetherQCBC13 was contaminated with QCBC29. By identifying whether there iscontamination of the same assay identifier in the second pool, one canidentify the source of contamination. In the second pool, SBC61 willonly be associated with QCBC13 in the absence of contamination. Howeverif SBC61 is also associated with QCBC29, this indicates that QCBC13 wascontaminated, since the contamination occurred in both pools. If SBC61is not associated with QCBC29, then QCBC13 is not contaminated, andSBC29 was the source of contamination in the first pool. The sameapproach also works for Group 1 sample identifiers present in the Group2 pool. The present methods provide the ability to differentiate betweencontamination of a sample identifier and contamination of an assayidentifier using two sequencing pools.

The present methods and compositions can also be used to determinesequence variation of random nucleotides found between two constantregions. The assay identifier can act as a standard sample barcode andonly one pool of samples would be required, assuming sequencing outputis sufficient to detect the level of contamination desired. Forinstance, this assay can be used to identify low level amount ofcontamination occurring in sequences where a small variable regionexists between two constant regions and may be beneficial foridentifying contamination or variation in oligonucleotides used for anyintended applications.

The foregoing description of exemplary or preferred embodiments shouldbe taken as illustrating, rather than as limiting, the present inventionwhich is defined by the claims. As will be readily appreciated, numerousvariations and combinations of the features set forth above can beutilized without departing from the present invention as set forth inthe claims. Such variations are not regarded as a departure from thescope of the invention, and all such variations are intended to beincluded within the scope of the following claims. All references citedherein are incorporated by reference in their entireties.

We claim:
 1. A kit for an assay for a set of oligonucleotide samples,the kit comprising: a set of oligonucleotide samples comprisingoligonucleotides, each oligonucleotide having a 5′ constant region, asample identifier, and a 3′ constant region, wherein each sampleidentifier is unique within the set; and a set of assay primerscomprising a priming portion and an assay identifier, wherein thepriming portion is the same as or complementary to one of the constantregions of the oligonucleotides, wherein each assay identifier is uniquewithin the set.
 2. The kit of claim 1, wherein each of theoligonucleotide samples of the set is in a separate vessel, and eachvessel comprises only one sample identifier unless one or more of thesamples is contaminated.
 3. The kit of claim 1, wherein the set ofoligonucleotide samples comprises at least 8 samples.
 4. The kit ofclaim 3, wherein the set of assay primers comprises at least 8 assayprimers.
 5. The kit of claim 1, wherein the set of oligonucleotidesamples comprises at least 32 samples.
 6. The kit of claim 5, whereinthe set of assay primers comprises at least 32 assay primers.
 7. The kitof claim 1, wherein the set of oligonucleotide samples comprises atleast 96 samples.
 8. The kit of claim 7, wherein the set of assayprimers comprises at least 96 assay primers.
 9. The kit of claim 1,wherein the assay primers further comprise a 5′ constant regioncomprising a standard 5′ amplification region for a sequencing platformand a sequencing priming region.
 10. The kit of claim 9, wherein thestandard 5′ amplification region comprises a P5 sequence or a P7sequence.
 11. The kit of claim 1, wherein the 5′ constant region of theoligonucleotides comprises a sequencing priming region.
 12. The kit ofclaim 1, wherein the 3′ constant region of the oligonucleotidescomprises a standard 3′ amplification region for a sequencing platform.13. The kit of claim 1, wherein the assay identifies contamination inthe set of oligonucleotide samples.
 14. A kit for an assay for a set ofset of oligonucleotide samples comprising oligonucleotides, eacholigonucleotide having a 5′ constant region, a sample identifier, and a3′ constant region, wherein each sample identifier is unique within theset, the kit comprising: an assay primer comprising a priming portionand an assay identifier, wherein the priming portion is the same as orcomplementary to one of the constant regions of the oligonucleotides,wherein: each assay identifier is unique within the set, and the setcomprises at least 8 assay primers in separate vessels.
 15. The kit ofclaim 14, wherein the set of assay primers comprises at least 16 assayprimers in separate vessels.
 16. The kit of claim 14, wherein the set ofassay primers comprises at least 32 assay primers in separate vessels.17. The kit of claim 14, wherein the set of assay primers comprises atleast 48 primers in separate vessels.
 18. The kit of claim 14, whereinthe set of assay primers comprises at least 96 primers in separatevessels.
 19. The kit of claim 14, wherein the assay primers furthercomprise a 5′ constant region comprising a standard 5′ amplificationregion for a sequencing platform and a sequencing priming region. 20.The kit of claim 14, wherein the assay identifies contamination in setsof oligonucleotides comprising sample identifiers.